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Abstract 

Portfolio balancing requires estimates of covariance between asset returns. Returns 
data have histories which greatly vary in length, since assets begin public trading at 
different times. This can lead to a huge amount of missing data — too much for the con- 
ventional imputation-based approach. Fortunately, a well-known factorization of the 
MVN likelihood under the prevailing historical missingness pattern leads to a simple 
algorithm of OLS regressions that is much more r eliable. When t h ere ar e more assets 
than returns, however, OLS becomes unstable. Gramacy et al. (j2008l ) showed how 



classical shrinkage regression may be used instead, thus extending the state of the art 
to much bigger asset collections, with further accuracy and interpretation advantages. 
In this paper, we detail a fully Bayesian hierarchical formulation that extends the 
framework further by allowing for heavy-tailed errors, relaxing the historical missing- 
ness assumption, and accounting for estimation risk. We illustrate how this approach 
compares favorably to the classical one using synthetic data and an investment exercise 
with real returns. An accompanying R package is on CRAN. 

Key words: multivariate, monotone missing data, data augmentation, ridge regres- 
sion, double-exponential, heavy tails, factor model, portfolio balancing 



1 Introduction 



Mean-variance portfolio allocation (e. g. . iMarkowitzl . Il959l ) requires the accurate and tractable 
estimation of the mean return, and the covariance between the returns, of a large number 
of assets. Assets become publicly tradeable at different times, so their return histories can 
greatly vary in length. Aside from a few "gaps", the histories of assets which are publicly 
tradeable at purchase time will exhibit a monotone missingness pattern. For example, Fig- 
ure [1] shows the monthly return availability for 1,200-odd stocks on NYSE & AMEX over 



missingness pattern in financial return data 




Figure 1: Missingness pattern in stock returns on the NYSE & AMEX. The assets (columns) 
are put into increasing order of the number of missing entries. 



29 years, some with as little as 1 years worth of data. The assets along the columns have 
been sorted by the number of missing entries. Besides some "gaps" the boundary between 
the dark and light regions is monotone. 

Generally speaking, inference in the presence of missing data is notorious ly difficult, usu- 
ally requiring hill-cli mbing iterative techniques like expectation maximization f lLittle and Rubin . 



20021 ; IS chafer! . 119971 )) which rapidly lose stability as the level of missingness increases. The 



Bayesian alternative of data augmentation is similarly unsatisfactory. Software packages im- 
plementing such algorithms come with prominently displ ayed warnings of failure when the 



missingness level is above 15% (see lGramacy et all 120081 ) . 



The nice thing about a monotone missingness pattern is that the likelihood has a con- 
venient factorization which makes inference tractable w ithout imputation. Under a multi- 
variate normal (MVN) assumption, a simple algorithm (lAndersenl . 119571 ; IStambaughl . Ii997| ) 
of ordinary least squares (OLS) regressions, one for each asset, yields a maximum likelihood 
estimator (MLE). Unfortunately, there must be fewer stocks than the length of the shortest 
return history, so that the design matrices of the OLS reg ressions are of full ran k. In particu- 
lar, you cannot have more stocks than historical returns. iGramacy et al.l ( 120081 ) showed that 
by replacing the OLS with "parsimonious regressions", e.g., principal components (PCR), 
ridge, lasso, etc., the above algorithm can be applied when there are more assets than his- 
torical returns. This extended the reach of Stambaugh's (1997) methods from dozens to 
thousands of assets, accommodating an essentially arbitrary level of historical missingness. 

In this paper we shall further extend the above parsimonious methodology in several 
directions by taking a fully Bayesian approach. Section [2] recalls the monotone decompo- 
sition and (MLE/Bayesian) inference algorithm. Section [3] reviews approaches to Bayesian 
shrinkage regression that are particularly convenient in this context, and which allow model 
averaging and heavy-tailed errors as minor embellishments. Section H] details how the ben- 
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efits of the Bayesian shrinkage posteriors filter through to inference about the mean vector 
and covariance matrix. It features extensions for data augmentation to deal with "gaps" , and 
allows estimation risk to influence the balanced portfolios — both of which were unavailable 
previously. In Section [5] we apply our methods in a Monte Carlo investment exercise and 
show how they compare favorably to the classical alternatives on real financial returns data. 
The paper concludes with a brief discussion in Section |6j 

The methods that are core to this paper are implem ented in a fully documented R 
( R Development Core Team . 2007 ) package called monomvn f Gramacyl . 2009 ). which is avail- 
able for download on the Comprehensive R Archive Network (CRAN). 



2 Multivariate normal monotone missing data 



We assume that the missingness mechanism is missing completely at random (MCAR). In 
the case of historic al asset returns th is may be a tenuous assumption, but it is convenient 
and common (e.g., IStambaughl . 119971 ) . We work with a n x m data matrix Y that collects 
the historical returns of the assets. Denote y^j = NA if the i th sample (historical return) of 



the j th covariate (asset) is missing; other wise y. 



'■„' 



Formally speakin g, the missingness 



patte rn in Y is said to be monotone [e.g., (jSchaferl . 119971 . Section 6.5.1) or (ILittle and Rubinl . 
20021 . Section 7.4)] if its columns can be re-arranged so that 7^ NA whenever y^j+i 7^ NA. 

We assume throughout that they are indeed arranged in this way, so that when we define 
nj = Y^i=i -"-{s/ii^NA} as f ne number of observed entries in column j, for j = 1, . . . ,m, we 
have that n = m and rij > rij+i. Furthermore, the rows may be arranged according to the 
same property (y^j 7^ NA whenever 2/j+i,j 7^ NA) without loss of generality, so that when we 
define yj = yi-. nj! j we are collecting the entirety of the observed entries in the j th column. 
The pattern that results is illustrated pictorially in Figure [2j For now we will assume that 

(n) 

there are no "gaps". It is also helpful to define Yj = Y^?^ as the nj x j design matrix 
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containing an intercept, and the first Uj rows of the first j — 1 columns of Y; see Figure [2J 
Under the monotone pattern the likelihood f(Y\6) emits a convenient factorization in 
terms of an auxiliary parameterization <fi = $(#). If the rows of Y are i.i.d. MVN, this 
factorization leads to an iterative algorithm for inferring the MLE = (/i, X). These 
assumptions may not b e appropriate for financial returns but they are common to kee p 
inference tractable (e.g., Stambaugh . 1997 ; Chan et al. . 1999 ; Jagannathan and Mai . 2003 ). 



The algorithm, due originally to lAndersenl (119571 ). begins by calculating jli and E^i = b\ 
in the usual way: fi\ = n^ 1 X^=i 2/i,i an d of = n^ 1 Yl^iiViA ~ Ai) 2 - Then, for j — 2, . . . , m 
the MLEs of 6j = (fij, Sjy-j-), j = 2,...,m, can then be obtained via a regression on 
the complete data in columns 1, 



,j — 1, i.e., using the model y^- 



Yjf3j + €j, where 
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Figure 2: Diagram of a monotone missingness pattern with m = 6 covariates, and n 
completely observed samples in y x = The design matrix Y 5 (without an intercept term) 
and the response vector ys for the fifth regression involved in maximizing the likelihood of 
MVN data under a monotone missingness pattern is also shown. Time (t) runs counter to 
the index % so that the most recent historical return is at time t = n, indexed by % = 1. 



{ei,jYiL\ ~ A/"(0, cr|). Here 0J = (/3qj, . . . , and er| are the auxiliary parameters 

When rank(Yj) = j, and particularly when rij > j, MLEs <f>j may be obtained in the 
usual way: fa = (Y/Y,) 'Y ; y, and a) = ^| |y, - Y^.|| 2 = £ ,(.</-, " (^)i-i ^') 2 - 
The components of 6j given 0\.(j-i) = (/*i : (j-i)> an d 0j are then 

A? = + ^l:(j-l),jAl:(j-l) an d ^lyj = 



The X thereby obtained will be positive-definite, as long as rij > j for all j — 1, . . . , m so 
that Yj is of full rank, and YjYj invertible. 



2.1 Bayesian inference 

The Bayesian approach follows naturally from priors on the auxiliary parameters /3 ■ and <r|, 
for j = 1, . . . , m. Samples from the implied posterior of fj,j and Eiyj are then obtained via 
in Eq. (JT|). It may be more desirable to choose priors directly in the natural parameter 
space = (/x, S), but it can be difficult to analytically derive the implied priors for (3j and 

<7~. However, a popul a r non -informative prior used for MVN data, p(fJb, S) oc |S| v 2 /, 
can be shown (jSchaferl . 1 19971 . Section 6.5.3) to imply the (independent) prior(s) p(/3 7 -,cr?) oc 



1 



-m+j 



giving the posterior conditionals: 



V 3 



A/}(/3, 



2| 



lG((rij — m + j — 



l)/2,(|| yj -Y,^|| 2 )/2). 



Stambaughl (119971 ) showed that, under this non-informative prior, it is possible to derive the 
moments of the Bayesian posterior predictive distribution in terms of the MLEs (ft, X) in 
closed form. When these are used in t he mean-variance framework to const ruct portfolios, 
they are said to take estimation risk (IKlein and Bawal . 1 19761 ; iBrownl . Il979l ) into account. 



However, these calculations similarly break down when rij < j. 

Inverted Wishart priors for S are amena ble to tractable posterior infere nce under the 
MVN with monotone missingness ( jLiul . Il993l ). One example is a ridge prior (jSchaferl . 119971 . 
Section 5.2.3), whic h is helpful when m > n and is closely related to ridge regression [see 
Section [3] as used by lGramacy et al.l (120081 ) in </>-space to good effect. This motivates a more 
pragmatic approach to prior selection: a deliberate search for appropriate shrinkage priors 
for the "big p small n" regression problem in 0-space, where the low rank Yj problem is 
manifest. Then, $ _1 completes the description implicitly in 0-space. 



3 Bayesian shrinkage regression 



Here we focus on appropriate regression models for the "big p small n" problem that employ 
shrinkage. The customary formulation is 



y = /3 ln + X/3 + e, 



where 



e~AT n (0,a 2 I, 



(2) 



One typically assumes a standardized nxp design matrix X where the columns are individu- 
ally adjusted to have zero-mean and unit L2-norm. This causes /3q and (3 to be independent a 
posteriori and recognizes that regularized posterior summaries for (3 are not equivariant un- 
der a re-scaling of X. Any such pre-processing must be undone before evaluating 6 = $~ 1 (0) 
at samples of <p = (/3 , /3, cr 2 ). 



Ridge regression and the lasso (e.g., lHastie et all 1200 ll . Section 3.4.3) are classical ap- 
proaches to shrinkage regression that penalize large coefficients: 



5.(9) 

p = argmm 



X/3) T (y-X/3) + A^ 1/3,1 



(3) 



for some A > 0, where the intercept is excluded from penalization via y = y — yl n . Choosing 

i 1 I 1 „ (2) 

q = 2 yields ridge regression (IHoerl and Kennardl . Il970l ) where = (X T X + AI) _1 X T y. 

The lasso ( Tibshirani . 1996 ) corresponds to q — 1. There is no closed form solution for /3 , 
but the entir e path of solutions for all A can be obtained iteratively via the LARS algorithm 
(jEfron et all 120041 ) . Both estimators may be interpreted as the posterior mode under a 
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(2) 



particular prior. For ridge regression the prior is f3 
i.i.d. Laplace (i.e., double-exponential) ir(f3^\a 2 ) = \ 



cr 2 ~ 



Af p (0, a 2 AI p ); for the lasso it is 



(<■/) 



Large values of the penalty parameter A cause the coefficients of (3 to be shrunk towards 

zero. The lasso estimator ^ ^ may have many coefficients shrunk to exactly zero, which is 
convenient for variable selection. Often, A is chosen vi a cross validation (CV ). As a <p- 
space regression for obtaining monotone MVN estimators. Gramacy et al.l ( 120081 1 chose A by 
applying the "one- standard-error" rule ( Hastie et al. . 200 ll . Section 7.10) with CV. 



3.1 Hierarchical models for Bayesian shrinkage regression 



For a fully Bayesian lasso we use the latent variable formulation of iPark and Casellal (120081 ) 
and ICarlin and Poison! (|199ll ) by representing the Laplace as a scale mixture of normals: 



y|/3 ,X,/3,a 2 ~A/;(A,ln + 
r 2 |A~Exp(A 2 /2) 



X/3,a 2 I n ) 

-), P oc 1 



D r = diag(r 1 2 , ...,r } 
a 2 ~ IGK/2 A/2 

A 2 ~G(a A ,6 A ). 



(4) 



IG and G are the rate- and scale-parameterized inverse-gamma and gamma distributions, 



oc cr is obtained with a a 
= 7? = ■ ■ ■ = r 2 , using r 2 



b a = 0, and ridge 
IG(a r /2A/2) (i.e., 
0. Fixing r 2 = oo yields the standard family of 



respectively. The default prior 7r(cr 2 ) 
regression is the special case where r 2 
dropping A 2 ), possibly with a T = b T = 
(improper) priors for linear regression. 

Choosing the prior parameterization for A 2 can be difficult. IPark and Casellal (120081 ) 
note that choosing a\ = b\ = leads to an improper posterior, and suggest some automatic 
alternatives. Another option i s to further expand the hierarchy using a so-called normal- 
gamma (NG) prior for j3 (e.g., iGriffin and Brownl . 120101 ) by specifying 



and 



A 2 h- 



G{a x ,bx/j), 
G( 7 ,A 2 /2). 



where 7 ~ Exp(L 



(5) 



Griffin and Brownl (120101 ) suggest a a = 2, and b\ = M/2, where M is chosen via empirical 
Bayes considerations. Observe that fixing 7 = 1 encodes the specific Laplace prior case. So 
the NG prior is more adaptive than the lasso. This may come in handy when p 3> n, i.e., 
when our prior plays a more important role, or when the posterior drives many /3j's to zero. 
The availability of full conditionals for all of the parameters makes for efficient Gibbs 
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sampling (GS). For the baseline Bayesian lasso model (IPark and Casella . 



20081 ) these are: 



/3ok 2 ,y ~A%,a 2 /n) 

/V, {Tf} p j=1 , y ~ KCp, ^A" 1 ), A = X T X + d; 1 , 3 = A- x x T y 

a 2 \f3, {rf} p j=1 , y ~ IG((a CT + n-l+ p)/2, % + ^)/2), ^ = | |y - X/3| | 2 + /3 T D; 1 /3 



rr 2 |^., a 2 , A ~ Inv-Gauss( ^/AV 2 //? 2 , A 2 ) (6) 

2| 



" 2 |/3„ a 2 , A ~ Inv-Gauss(y / A 2 a 2 //3 2 , A 2 
A 2 |r 2 , . . . , Tp ~ G(a A + jry, b x /i + EjU r j / 2 )- [assuming 7 = 1] 

Using a marginal posterior conditional for a 2 instead can help reduce autocorrelation in 
the Markov chain. Integrating over the posterior conditional for (3 gives cr 2 |r 2 , . . . , r 2 , y ~ 

IG((a CT + n - l)/2, % + fy)/2), where ^ = | |y - X^| | 2 + ^D; 1 ^ = y T y - f A@. 

Under the ridge regression model the posterior conditionals are the same ([6]) except that 
we ignore A 2 and take r 2 ~ IG((a T +p)/2, (b T + a~ 2 f3 J (3)/ 2). Upon fixing r 2 = oo we must 
use A = X T X, subtract p/2 to the rate parameter to the IG conditional(s) for a 2 , and 
ensure a proper posterior with a a > p — n — 1. This may pose a non-trivial restriction on 
the prior when p > n. Under the NG prior 7 may vary, leading to the conditionals 

t 2 \^, a 2 , A 2 , 7 ~ GIG( 7 - 1/2, /3> 2 , A 2 ) (7) 
'A 2 ' 



7 |{r 2 }^ =1 ,A 2 oc 



2 




where GIG(A,x,?/0 is the generalized inverse Gaussian distribution. Griffin and Brownl 



( I2010l ) suggest a random walk Metropolis update for the 7 using proposals 7' = exp{<r 7 z} 



for z ~ A/"(0, 1). These proposals are accepted with probability 

7' -7 



mm 




where 77(7) = 7 _2 exp(— 7 — |^A 2 ) and a 7 is chosen to give an acceptance rate of 20-30%. 

An alter native hierarchical modeling framework for the Bayesian lasso is provided by 

Hans fl2008h . While it does not require p latent r 2 variables, the resulting GS procedure is 



3 

J2 



not fully blocked, and rejection sampling is required for a . "Orthogonalizing" the sampler 
helps mitiga te slow mixing of the un -blocked conditionals. However, we prefer the simpler 



approach of IPark and Casellal ( 120081 ) as it is more readily adaptable to the p ^> n case, to 
model selection, and our extensions to the heavy-tailed errors. 

Whereas the classical lasso has the property that the estimate (3 ^ may have components 
which are zero — in fact, it would never have more than min{p, n — 1} nonzero components — 
samples of (3 from the posterior would never have zeros. So the Bayesian lasso is less useful 
for variable selection. We also note that when p > n — and without the ability to explicitly 
restrict (3 to having at most min-Jjo, n — 1} nonzero components — a proper prior must be 
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used for a 2 or the posterior will be improper. An empirical Bayes remedy that works well 
in this case is to take a small a a , say a a = 3/2, and then set b a so that the (1 — a) part of 
the lG(a a ,b a ) distribution lies at the point y T y (i.e., the MLE under the intercept model) 
via the incomplete gamma inverse function. Another remedy is Bayesian model averaging. 



3.2 Bayesian model selection and averaging 



Although the MAP lasso fit may indeed set some of the coordinates of /3 to zero, this is 
more of a side effect of the soluti on space of the quadratic program fl3]) than the result of a 
deliberate prior modeling choice ([Hand . 120081 ). Bayesians rarely base inference on the MAP; 
it is more natural to select variables by inspecting the posterior model probabilities. 

There are several standard ways of performing Bayesian variable selection in regression 
models that are amenable to GS. They essentially fall into two camps. Loosely, the first 



camp (e.g., iGewekd . Il996t iGeorge and McCullochl . 119931 ) uses a product-space wherein the 



prior for each (3j is augmented to include a point-mass at zero. Inference proceeds by GS 
on each of the condit i onals 0j \ /3_j , y , . . . , j = 1, . . . ,p, which may flop between zero and 
nonzero values. lHansI ( 120081 ) augmented this product space approach to variable selection 
under the Laplace prior by further conditioning; on A. 



The second camp (e.g., iTroughton and Godsilll . 119971 ) is transdimensional in that the 



/3-vector may vary in length while model space is traversed via Reversible Jump (RJ) 



MCM C (iGreen!. Il995l). We p refer this approach for our monotone inference application 
in the IPark and Casellal ( 20081 ) setup. When p ^> n it is implementationally more compact, 



only requiring memory for the (nonzero) /3-components. This represents a big savings when 
simultaneously storing m regression model parameter sets under a prior that restricts the 
model to have at most min{j, rij — 1} (nonzero) coefficients. Also, we like the fully blocked 
samples for the nonzero components of (3 for the within-model moves. 

Suppose that the transdimensional Markov chain is currently visiting some model with 
k nonzero regression coefficients (3 k = (/?i, . . . , (3 k ) using design matrix X^. The columns of 
Xfc should come from a two-way partition (of k and p — k elements) of the p columns of X, 
but they need not coincide with the first k of the p columns. Now consider proposing to add 
a column to a so-called "birth" move. Choose one of the p — k columns of X not present 
in Xfc for addition, thus creating ~X.k+i- By considering the ratio of the marginal posterior 
distributions (integrating out (3 k and /3 fc+1 ) conditional on cr 2 , r 2 , . . . , r k and a new proposed 
T l+\ (which we take from the prior), it can be shown that the transdimensional move may 
be accepted with probability min{l, A^k+i}, where 



.4 





2 

+ 1 


A 


fc+il)^exp< 


2a 2 @k+l^k+lfi k+l | 




A 


-i 

k 


2 exp < 


2^2/3 k Afc^fc 





X 



n(k + l)q(k + l->k) 
7r(k)q(k -)■ k + 1) 



(9) 



and Aj 



(3 k = A k 1 Xjy, with D Tfe = diag(r 2 , . . . , r|). The reverse "death" 
move, of proposing to remove one of the columns of X^, may be accepted with probability 



X^X. + D- 1 



S 



min{l, A k \_^^\. Under the ridge prior, 



'fc+i 



r 2 can be dropped from the expression unless 



k = 0; for standard regression it may be ignored so long as a proper prior is used for (3 k . 

A uniform prior over all models with k nonzero components is typical. Often, n(k) oc 
l,Wk G {1, . . . ,p*}. However, we prefer to take k ~ Bin(j>*,7r), with ix G (0,1) where n 
controls the "sparsity" , and p* denotes p or min{j9, n — 1} for compactness. Prior information 
on 7r may be interjected either by fixing a particular value, or b y taki n g a h ierarchical 



approach with ir ~ Beta(g, h) (e.g., iGeorge and McCullochl . Il993l ). iHand (120081 ) used g = 
h = 1 for with a Laplace prior in the product space. The posterior conditional for GS is 
ir\k ~ Beta(g + fc, g+p* — k). In our transdimensional approach, we choose a uniform proposal 
for the valid jumps. For a "birth" we take q(0 — > 1) = 1/p, and q(k — > k + 1) = l/2(p — k) 
for k = l,...,p* — 1. Conversely for a "death" we take q(p* —> p* — 1) = 1/p* and 
q(k — > k — 1) = 1/2/c for fc = p* — 1, . . . , 1. Otherwise q(k — > k') = 0. 

Movement throughout the 2 P sized space will be slow for large p, so a certain amount of 
thinning of the RJ-MCMC chain is appropriate. Collecting a sample from the posterior after 
p transdimensional moves approximates the model-level mixing (and computation burden) 
of the product-space approach. Throughout the RJ-MCMC the length of f3 varies, and the 
components shift to represent the partition of X stored in the columns of Therefore, 
post-processing is necessary if samples of (3 are to be used elsewhere, e.g., in 0-space via 
(pj), where a full p- vector having zero and nonzero entries in the correct positions is needed. 
This may be facilitated by maintaining a /c-vector of column indicators. The posterior 
probability that variable j, j = 1, . . . ,p, is relevant for predicting y is then proportional to 



where T is the number of samples saved from the Markov chain. 



3.3 Student-^ errors via scale-mixtures 

The MVN assumption is not always appropriate. We may wish to consider the possibility 
that errors in y have a Student-t distribution with an unknown degrees of freedom v: 



y = (3 l n + X/3 



{e^i-St^V). 



(10) 



Following ICarlin et all fll992h and iGewekd fll993f ) we shall represent the Student-t distribu- 
tion as a scale mixture of normals with an IG(^/2, vj c £) mixing density. 

We must redefine X = (l n , X) asanx (p + 1) matrix, (3 = (/3 , /3 T ) T = {(3j}*j =0 so that 
the model in Eq. (jTQl becomes y = X/3 + e since the posterior intercept (3o is no longer 
independent of the other components of f3 in the presence of heavy-tailed errors. The setup 
is otherwise unchanged from Section |3~T1 Upon assuming an exponential prior for the degrees 
of freedom parameter, z/, the modifications to the hierarchical model in Eq. 



are: 



y |X, /3, a 2 , H 2 }? =1 ~ A4(X/3, <r 2 D w ) 
(3\a 2 ,{rf} p J=1 ^Af p+1 (0,a 2 B T ) 

u 2 \u ~ IG(i//2, u/2) 



D w = diag(w 2 , 

D r = diag(oo, r 2 , . . . 

u\9 ~ Exp(9). 
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Note that D r is a p + 1 diagonal matrix, and that the first component insures that (3$ is 
given a flat prior as before. After redefining A = X T D~ 1 X + D" 1 ,^ = A~ 1 X T D~ 1 y and 
"4>p = (y ~~ ^/3) T Dj 1 (y — X/3) + ^"""D" 1 ^, the modified full posterior conditionals follow: 

f3W 2 , {rf}U, H 2 }r=i, y ~ A/; +1 (A a 2 A- 1 ) 



(12) 



^|/3,{r^ =1 ,{^}r =1 ,y~IG 



+ ra + p b a + ipf3 



iid 



, z/,y ~ IG 



1 i/ 



(T 



2 

- 2 ((y 



u ' 



oc 



r 



exp(— r]v) 



where rj = \ EILi(M^ 2 ) + + #0 

The conditional posterior of v does not correspond to a standard distribu tion, however a 
convenient rejection sampling method (with low rejection rate) is available ( iGewekd . 119921 ) 
using an exponential envelope. The optimal scale parameter v* can be chosen to minimize 
the unconditional rejection rate by finding the root of (n/2)[log(^/2) + 1 — ^(z//2)] + z/~ x — rj, 
where \1/ is the digamma function. Standard Newton-like methods work well. A draw from 
v ~ Exp(z/*) may then be retained with probability^ 

n/2 

exp[(^-^)(K)- 1 -r / 



mm 



r(i/*/2) 



L r (^/ 2 ) J 



[v/2f 



_(v*/2)»*_ 



As before we may integrate out /3 obtaining a |{r; \} P j=ii {. u i}i=ii Y ~ IG((a CT + n — 

y T D- 1 y-3 T A^. 
via Eq. (Q, may 
XlD^X.+D 



l)/2, (6 CT +^)/2) by redefining^ = (y-X/3) T D~ 1 ( y -X/3)+/3 ' 
Finally, the Bayesian model selection and averaging method of Section 
be used with X fc = (l n , X fc ), (3 k = (A,, /3j) T , fa = A^XjDjV and A 
and D Tfc = diag(oo, rf , . . . , r|). The number of latent variables now grows with the sample 
size, so automatic 0(n) thinning from the Markov Chain is sensible. 



Tk 



3.4 Empirical results on detecting fat tails 

Hansl (120081 ) and I Griffin and Brownl t()l(h offer a plethora of insights about the Bayesian 



lasso and NG with comparison to the classical lasso. There is no need to re-produce these 
results here. Instead we offer a demonstration of the Student-t extensions that are unique to 
our setup and relevant in light of the recent criticism of MVN in the financial press. Basically, 
we explore the extent to which deviations from normality may be detected by testing the 
null hypothesis (model .Mn) of normal errors versus the alternative (model Aist) that they 
follow a Student-t with v unknown. One way to do this is via a posterior odds ratio (POR): 



p(-M N |y) ti(M n ) p(y\M 



x 



p(M st \y) Tr(-Mst) p(y\M 



St J 



[prior ratio] x [Bayes factor] 



1 Note that there is a typo in the conditional for v provided by Gewekd ( 1993 ). 
2 There is also a typo in the acceptance probability provided by Gewekd (1992). 
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where 7r(.M*) is the prior on .M*, and p(y\M.*) is the marginal likelihood for .M*. By taking 
equal priors we may concentrate on the Bayes factor (BF). 

Calcula t ing P ORs and BFs can be difficult in generality; for a review of related methods 



sec 



Godsilll (j200lf ). However, we may ex ploit that the Student -t and normal models differ by 



just one parameter in the likelihood, v. iJacquier et al.l ( 120041 . Section 2.5.1) show that this 



BF may be calculated by writing it as the expectation of the ratio of un-normalized posteriors 
with respect to the posterior under the Student-t model. That is, we may calculate 

E f 1 if pW">,Mn) where (Hfi^M^MA, 

\p(y\xl>,v,M st )\ r^p(y|^W, v W,A<st) 

and where if) collects the parameters shared by both models. 

To shed light on the "selectability" of the Student-t model (ITUj) . consider synthetic data 
where (3 = (2, —3, 0, 0.75, 0, 0, — 0.9) T , ji = (3 = 1, the rows of the n x 7 design matrix X 
are uniformly distributed in [0, l] 7 , and 6i ~ St(0, a 2 = 1; z/), for i = 1, . . . , n. We perform 
a Monte Carlo experiment where n and v vary, with n e {30, 75, 100, 200, 500, 1000} and 
v G {3,5,7,10,oo}, and consider the frequency of times that the BF indicated "strong" 
preference for the correct model in repeated trials. In each trial, GS (fT2"j) was used to obtain 
1200 samples from the posterior by thinning every In rounds, with the first 200 discarded 
as burn-in. For n < 200 we repeated the experiment with random data 300 times; when 
n = 500 we used 50 replications; and when n = 1000 we used 20. 
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Figure 3: Frequency of correct model determinations as a function of the sample size, n, and 
the degrees of freedom parameter, u, where "normal" is interpreted as v = 00. 

Figure [3] shows the relationships between n, v and the frequency of correct model deter- 
minations (higher frequencies are better). In the case of normal errors, and Student-t errors 
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with v — 3, the correct model can be determined with high accuracy when n > 200. When 
v = 5 a sample size of n = 1000 is needed; when u = 7,10 we need n ^> 1000. Clearly for 
10 < v < oo the situation is hopeless unless n is very large. These results have implications 
in the context of our motivating financial returns data in Section [51 indicating that long 
return histories may be required to benefit from relaxing the MVN assumption. 



4 Bayesian inference under monotone missingness 

Here we collect ideas from the previous sections in order to sample from the joint posterior 
distribution of 6 = (/x, S). Let </> ■ = (Ajjj Ap cr?) ~ i37£j = B1Z(Yj,yj) represent samples 
collected from the posterior of the chosen 0-space (shrinkage) regression models — one of the 
ones from Section [3J Then, following Eq. (TTJ from Section [21 samples from the posterior 
distribution of may be obtained by repeating the following steps. 

1. Sample (fii, = (A),i, of) ~ jB72-i. See below for details of this special case. 

2. For j = 2, . . . ,m: 

(a) Sample (/3 j, cr?) ~ S7£j-. 

(b) Convert (/i i; S 1:jj ) = $ _1 (/3 j, of, following Eq. ©. 

Since the Bayesian regressions (BTZj) are mutually independent, step 1 and the j — 1 steps 
of 2(a) may be performed in parallel, and the conversion via $ _1 in 2(b) may be performed 
offline. Zeros in /3 ■ may translate into zeros in which may be used to test hypotheses 

about the marginal and conditional independence between assets. 

In the default formulation of BTZj with standard normal errors (jlj) we use Yj = Yfe_iv 
In this case the first step above (j = 1) simplifies to: 



ni - 1 b a + ||yi 



1 2 



£ lt i~IG ~ » n > then Hi~N(yi,Zi,i/n). (13) 



If heavy-tailed errors are modeled [Section 13TB] for the </>-space regressions (BTZj), then take 

0:(j-l)- 



Yj = Y^V.^. In this case the first step above (j = 1) requires integrating over {k^}^. 



Conditional on a particular D Wl = diag^^, . . . , x ) sampled from their full conditional 
under BTZ\ (conditional on V\ in Eq. ( [12]) which must also be integrated out), we may sample 

£,.~IG( a ' + r^' + y ^' yi y then (14) 



where n m = Y^i=i Carrying this through in the second step, above, for j = 2, . . . , m, 
etc., integrating over independent (u 2 , {w^}™^), . . . (v m , {^ m }£Ti), is similarly parallelizable. 
The marginal Student-t error structure for y^- carries over into 6j-spa.ce giving a distinct 
degrees of freedom parameter Vj for each (marginal) yj, for j = 1, . . . ,m. So the resulting 
model in 0-space is not the typical multivariate Student-t which has a single v. 
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This more standard multivariate Student-t model may be obtained by modifying the 
prior so that all ujfj depend on a common v. I.e., 

WqXP^o^v ~ IG I — — , I, 2 = l,...,n i; j = l,...,m 

(using intercept-extended Yj and Then, the full conditional of v becomes 

p{v\{{<4}Zi}?=*,0) (r (0) -E ^ nJ ex P (-^), 

where 77 = | X^j=i X^iOog^fj) + w ij 2 ) + ^- The same rejection sampling method (i.e., 
using n = X]j=i n i) applies. Although we may proceed with sampling from BTZj using 

Uj = z/® ignoring the full conditional for uj, the independence between the BIZj is now 
broken: parallelization of the MCMC is bottlenecked by sampling the common v. 

The remainder of the section covers the extensions particular to the portfolio balancing 
problem: handling "gaps" , incorporating known factors, and accounting for estimation risk. 

4.1 Dealing with "gaps" by monotone data augmentation 

Data augmentation (DA) is an established (Bayesian) technique for dealing with missing 
data. In short, it involves treating the unknown portion of the data as latent variables 
and updating, or imputing, their values jointly with the othe r unknown (mo del) parameters 



via the posterior predictive. For a high level overview see I S chafer! (119971 . Section 3.4.2). 
Rather than treat all of the missing data as latent in this way, it is sufficient impute a small 
portion to achieve a monotone missingness pattern. Then, inference ma y proceed as a l ready 



described. This is known as monotone data augmentation (MDA) (ILil . Il988l : ISchaferl . 119971 . 
Section 6.5.4). In the case of the financial returns data Y at hand, with sorted columns and 
rows, the appropriate candidates for imputation are easily spotted. 

Consider each y^j = NA such that there exists a yiji 7^ NA where j' > j. Specially mark 
these with y i; j = NaN, say, as these are the entries that must be treated as latent. There 
will not be any if the pattern is monotone. Then sort the rows of Y by the number entries 
which are (still) NA so that those with more NAs appear towards the bottom of Y. Define 
,j = X^ILi ^{wj^NA} as before, but now its interpretation is as the number of observed entries 



n 



in the j th column, plus the number which are treated as latent. Some entries of Yj and y^, 
defined as before, may contain NaNs. However, note that y n ,j 7^ NaN by construction. 

Let Yj index the rows of column j of Y such that y_j[rj] = NaN. When sampling from 
BIZj in step 2(a), above, ignore the rows of Y^ and y^- in r,-, but otherwise proceed as usual. 
Then, add a step, 2(c), wherein for i G ij 

2(c): Vij ~ AftPjYi^.j, of), 

where Yj^.j is the row vector containing the i th row of Yj. Observe that the mutual in- 
dependence of the BTZj is broken by this MDA. They must be processed in serial so that 
BHj+i(Y j+i, Yj+i) may proceed with an up to date copy of Y. 
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4.2 Incorporating known factors 

A popular way of developing an estimator of the covariance matrix of financial asset returns 
is via factor models. The idea is that certain market-level indices, like the value-weighted 
mark et index, the size of the firm associa t ed wi th the asset, and the book-to-market factor 



e.g., IChan et al.l . Il999l ; iFama and French! . Il993l ) provide a good basis for describing individ- 



ual returns. Importantly, these factors are easy to calculate as a function of readily available 
"fundamentals" (characteristics of the listed assets and companies) and the stock returns. 
Through covariances calculated between the factors and individual asset returns we may 
infer covariances between each of the assets. For a n x K matrix of (known) factors F, where 
K <C mm{n,p} and where the factors have covariance f2, the factor model approach poses 
the following model for the returns Y via columns j = 1, . . . , m: 

Yj = Vj + F j^j + e j, where e,- ~ Af n . (0, <r|l nj ). (15) 

(n ■) 

Take Fj = F^.^_ 1 ^ ) , i.e., without a column of ones, and treat the regression coefficients Xqj, Xj 
as unknown. If A is the K x m matrix defined by collecting the Xj column-wise, then a 
covariance matrix on the returns Y may be obtained as 

S (/) = A T f2A + D CT , where D CT = diag(of, . . . , a 2 m ). 

* if) 

The MLE S ' may be obtained via the standard estimate of ft and the MLEs {A,,}™^. 
Similarly, one may sample from the Bayesian posterior with suitable (non-informative) in- 
dependent priors on f2, and {Aoj, Xj, aj}™^. 

The estimators of obtained in this way tend to have low variance, which is a desirable 
property. However, they also have a strong bias, which may be undesirable. The bias stems 
from an implicit assumption that the returns are mutually independent when conditioned 
on the factors. Not only might this not be a reasonable assumption, but it also makes the 
quality of the resulting estimator(s) extremely sensitive to the choice of factors. This bias 
may be mitigated to some extent by further involving the return s in the estimation proc ess, 

i.e., in a more direct way. One such approach, considered by iLedoit and Wolf d2002h . is 

* (/) 

to take a convex combination of a factor-based estimator (X ) and a standard (possibly 

* (c) 

non-positive definite) complete data estimator (S ): 

S W =aS (/) + (l-a)S (c) , for a G [0,1]. (16) 

The mixing proportion, a, may be determined by CV. That this approach works well is a 
tes tament to the importa nce of combining a factor model with a more direct approach. 



Gramacy et al.l (120081 ) described hybrid method of incorporating the factors into the 
0-space procedure of the monotone factorization MLE so the data may inform on which 
independence assumptions are adequate. Consider the combined regression model: 

y, • yjij ■ v x, ■ < r (17) 
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Observe that the A ,j term present in Eq. (JTSjl has been dropped because it is not identifiable 
in the presence of (3 j. With some bookkeeping, the model described in Eq. ([17]) can be used 
to obtain a joint estimator for a m + K element mean vector and (m + K) x (m + K) 
covariance matrix from which \i and X may be extracted. To fix ideas, suppose that the 
factors are completely observed, which is usually the case. Then we may sample from the 
regression models in BlZ(\Fj Yj], y^) in 0-space, and after transformation to 0-space via $ _1 
in Eq. (JTJ the latter m components fJ>(K+iy.(m+K), an d m rows/cols ^(K+iy.(m+K),(K+iy.{m+K) : 
may be extracted as a sample of the mean and covariance of the returns. Shrinkage (or 
model averaging) in the regression model enables the columns in [Fj Y^], be they factors 
or returns, that are least useful for predicting yj to be down-weighted. That way, rather 
than having one parameter governing the trade-off, like a in Eq. (fT6|) . the m — 1 Bayesian 
shrinkage regressions can choose the right balance of factors and returns for each asset. Prior 
knowledge that many assets will be independent when conditioned upon appropriate factors 
may reasonably translate into small it (controlling the level of sparsity) encoding a preference 
for a small proportion of nonzero components of (3 in the 0-space regressions. 

4.3 Balancing portfolios and accounting for estimation risk 

A portfolio is balanced by choosing m weights w describing the portion of the portfolio 
invested in each asset. A standard technique uses the mean and t he covariance between 



returns to obtain a mean-variance efficient portfolio (lMarkowitzl . ll959l ) by solving a quadratic 
program (QP). Common formulations include the following H 

A so-called minimum variance portfolio may be obtained by solving 

argminw T Sw, subject to w T l = 1. (18) 

w 

Typical extensions include capping the weights, e.g., < Wj < 2/m, for j — 1, . . . , m. 

The above formulation may be augmented to involve the estimated mean return. One 
way is to aim for a minimum expected return \i while minimizing the variance of the portfolio: 

argminw T Sw, subject to w T ^ > /i, (19) 

w 

and w T l = 1. 

Similar heuristic augmentations apply here as well. A common extension is to assume that 
there is a risk- free asset available, e.g., a Treasury bond, at rate of return Rf. Then the 
constraints may be relaxed to/Lt>/i+(l — w T l)i?/ > /i and w T l < 1. 

Given fi and S the solutions to t hese QPs, which are strictly convex, are essentially 
trivial to obtain. Gramacy et al. ( 20081 ) showed that when shrinkage-based MLEs fi and S — 



constructed from all available returns via the monotone factorized likelihood — are used to 
balance portfolios in this way, they outperform a wealth of alternatives based upon estimators 
that could only use the completely observed instances. In the Bayesian context there are 



3 In all cases we have the tacit constraint that < Wj < 1, for all j = 1, 
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several ready extensions to this approach. For example, the MAP parameterization can 
be used to balance the portfolio. Another sensible option is to use the posterior mean 
of /x and X, which we show empirically leads to improved estimators of a true (known) 
generating distribution [Section [4.4] , and to improved portfolios [Sect ion But the portfolio 
balancing problem is about choosing at time t, say, weights to maximize expected return 
and/or minimize variance under the posterior predictive distribution p(y(' +1 ) |Y^). Here 
Y(*) = Yi :n : represents the returns available up to time t, and y(* +1 ) is the vector of returns 
at time t + lj^ Parameter un c ertainty (a.k.a., estimation r isk) is taken into account by 
integration ( Zellner and Chetty . 1965 ; Klein and Bawa . 19761 ): 



p(y( t + 1 ^\Y^) = J p(y (m V> E)p(/Lt, E|Y®) dfidH. 



(20) 



Calculating this integral (or its moments, for use in the QP) is not, in general, tractable. 
However, with i.i.d. MVN returns (or another elliptical distribution) the moments are easily 



obtained (jPolson and Tew 
E{S|YW} + Var{/x|Y«}, i.e. 



2000h as = E{y(* +1 )|YW} = E{fi\Y^} and £ (m) 



via a conditional variance identity. 
In the case of completely observed returns, a standard non- informative prior p(fjL, S 

Poison and Tewl (l2000h 



oc 



' m + l ' 



and (importantly) more returns than assets (n > m) 



show that there is "no effect of parameter uncertainty on the portfolio rule" since = fi 

and £(* +1 ) = cS, where the constant c is available in closed form and is a function of n and 



m onl y. In the case of historical returns of varying length, and when n 3> m, IStambaugh 



(119971 ) shows that we again have that fx 



(t+i) 



fi, and that is available in closed 

form but is not a scalar multiple of X. It can be shown empirically that incorporating this 
parameter uncertainty (a.k.a., estimation risk) leads to improved investments. 

We are motivated by the situations in which these analyti cal approaches do not apply, i.e., 
when m > n or when rij < j for any j — 1, . . . , m. Although iGramacy et al.l (120081 ) extended 
the MLE approach to the rij < j setting by employing parsimonious regressions, accounting 
for estimation risk remained illusive. The problem is best exposed in the calculation of 
Stambaugh's V in Eq. (69-71), pp. 302, where the resulting diagonal is negative when m > n. 
But under the fully Bayesian approach, where samples of /x and X may be taken from the 
posterior, the above conditional variance identity may be used to approximate with 
arbitrary precision. 



4.4 Empirical results and comparisons 

As a first point of comparison we pit the MLE based point-estimates of /x and S against 
the Bayesian alternative: posterior expectations. We simulated synthetic data from known 
fj, and S, imposed a uniformly random monotone missingness pattern, and then calculated 
the expected (predictive) log likelihood (ELL) of the so-parameterized MVN distribution(s). 
The ELL of data sampled from a density p relative to a density q (usually estimated) is 
given by E p {logg} = f p(x) logg(x) dx = H{p) — D KL (q \\ p), where H(p) = f plogp is the 

4 The i.i.d. assumptions erode the meaning of time. Notionally, t runs counter to i = 1, . . . , n. 
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entropy of p, and D Kh (q \\ p) is the Kullback-Leibler (KL) divergence between q and p. The 
entropy and KL divergence are known in closed form for MVN densities p and q. When q 
uses point-estimates (/x, E), and p uses the truth (/x, E), the ELL is given by: 

1 . r ,_ 1 A IEI . ,„ ,TA-1. 




--log{(27r e ) w |S|}--|log^+tr(S E) + (/x - /x) 1 £ (A — A*) ) - (21) 
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Table 1: Summary of the rankings (by ELL (12 of fifteen parsimonious regression methods 
used in the monotone MVN algorithm on 50 randomly generated MVN parameterizations, 
via two data generation "methods" . The best ranks appear in green; worst in red. 

Table [T]contains summary information about the relative rankings of the ELL calculations 
for nine Bayesian and nine MLE estimators in a series of 50 repeated experiments. The results 
in the top portion of the table are for random MVN parameterizations that were obtained 
using the randmvn function from the monomvn package, with argument method=" normwish", 
and m = n = 100. Uniform monotone missingness patterns are then obtained with rmono. 
The use of parsimonious 0-space regressions (lasso, NG, ridge) with model selection via RJ- 
MCMC is determined by the setting of S G [0, 1). A parsimonious regression is performed 
if Srij < j, and OLS is used otherwise — OLS regression is never used when 5 = 0. Observe 
from this portion of the table that the Bayesian estimators always obtain a better rank than 
the MLE ones. In the Bayesian case, ridge regression nudges out the lasso/NG. This makes 
sense because the data generation method ("normwish") does not allow for any marginal or 
conditional independencies, i.e., zeros in £ or S" 1 . The situation is reversed for the MLE, 
where the lasso comes out on top. In all cases but the MLE/CV lasso implementation, a 
lower value of 5 gives improved performance, indicating that parsimonious regressions yield 
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improvements even when they are not strictly necessary However, lower 5 comes with a 
higher computational cost. Given that the improvement of 5 = over 5 = 0.2 is slight, the 
higher setting may be preferred. Finally, observe that the Bayesian lasso edges out the NG. 

The results of a similar experiment, except using the argument method="parsimonious" 
to randmvn, are summarized in the second portion of the table. This data generation mech- 
anism allows for conditional and marginal independence in the data by building up fi and 
5] sequentially, via randomly generating /3 - (possibly having zero-entries) and applying 
as in Eq. ([T]). Here the number of nonzero entries of (each) (3j follows a Bin(j, 0.1) distri- 
bution. In this case the results in the table indicate that the lasso/NG is the winner. The 
distinction between Bayesian and MLE/CV methods is as before. This time, NG edges out 
the Bayesian lasso. Finally, the bottom two portions of the table report ranks for the same 
two experiments described above, but using n = 1000 so that parsimonious regressions are 
needed less often. These results are similar to the n = 100 case except that the distinction 
between the Bayesian and MLE/CV ranks is blurred somewhat. 



5 Asset management by portfolio balancing 



Here we return to the motivating asset management problem from Section [TJ We examine the 
characteristics of minimum variance portfolios f fl~9]) constructed using estimates of S based 
upon historical monthly returns through a Monte Carlo experiment of repeated inv e stmen t 



exercises. Th e experimental setu p closely mirrors the one used by iGramacy et al.l ( 120081 ) 



modeled after IChan et al.l (1l999h . The data consist of returns of common domestic stocks 
traded on the NYSE and the AMEX from April 1968 until 1998 that have a share price 
greater than $5 and a market capitalization greater than 20% based on the size distribution 
of NYSE firms. All such "qualifying stocks" are used — not just ones that survived to 1998. 
Since the i.i.d. assumption is only valid locally (in time) due to the conditional heteroskedastic 
nature of financial returns, estimators of XI are constructed based upon (at most) the most 
recently available 60 months of historical returns. Short selling is not allowed; all portfolio 
weights must be non-negative. Although practitioners often impose a heuristic cap on the 
weights of balance d portfolios, e.g., at 2%, in order to "tame occ asional bold forecasts" 



( Chan et al.l . Il999l ) or to "curb the effects ... of poor estimators" ( IJagannathan and Ma . 



20031 ). we specifically do not do so here in order to fully expose the relative qualities of the 
estimators in question. 

Our analysis in Section 13.31 suggests that the benefit of modeling Student-t errors will 
lead to minor improv ements in our estimators based upon just 60 or fewer historical returns. 
Jacquier et al.l ( 120041 ) showed that Bayes factors give strong preference to (the simpler) nor- 
mal model over the Student-t at return frequencies less than weekly — supporting the aggre- 
gation normality of monthly returns. Models with Student-t errors were included in initial 
versions of our exercise, but they performed no better than their normal counterparts. This, 
and the extra computational burden required by the extra 0(n) extra latent variables, led 
us to exclude the Student-t comparators from the experiment reported on below. 

The Monte Carlo experiment consists of 50 random repeated paths through 26 years, 
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starting in April 1972. In each year 250 "qualifying stocks" with at least 12 months of 
historical returns are chosen randomly without replacement. Using at most the last 60 
returns of the 250 assets, estimates of the covariance matrix X of monthly excess returns 
(over the monthly Treasury Bill rate) are calculated under our various methods and used to 
construct minimum variance portfolios. The port folios are then held (fixed) for the year. To 
assess their quality and characteristics we follow Chan et al.l ( 1999 ) in using the following: 
(annualized) mean return and standard deviation; (annualized) Sharpe ratio (average return 
in excess of the Treasury bill rate divided by the standard deviation); (annualized) tracking 
error (standard deviation of the return in excess of the S&P500); correlation to the market 
(S&P500); average number of stocks with weights above 0.5%. Generally speaking, portfolios 
with high mean return and low standard deviation, i.e., with large Sharpe ratio, are preferred. 
Sharpe ratios being roughly equal, we prefer those with lower tracking error. 
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0.519 


0.117 


0.685 


21 


ridge 


0.158 


0.165 


0.540 


0.122 


0.717 


18 


bridge 


0.140 


0.129 


0.554 


0.089 


0.829 


27 


lasso 


0.149 


0.149 


0.543 


0.054 


0.940 


69 


blasso 


0.144 


0.136 


0.561 


0.078 


0.871 


39 


bng 


0.144 


0.136 


0.560 


0.078 


0.872 


39 


fridge 


0.158 


0.164 


0.549 


0.121 


0.719 


19 


bfridge 


0.142 


0.129 


0.571 


0.085 


0.846 


34 


flasso 


0.150 


0.148 


0.552 


0.056 


0.935 


69 


bflasso 


0.148 


0.138 


0.573 


0.070 


0.898 


51 


bfng 


0.148 


0.138 


0.574 


0.071 


0.896 


50 



Table 2: Comparing statistics summarizing the returns of yearly buy-and-hold portfolios 
generated over 50 repeated random paths through the 26 years of monthly historical returns. 

Table [2] summarizes the results. It is broken into five sections, vertically. The first section 
gives results for the equal- and value-weighted portfolios. The second section uses standard 
estimators of S based only upon the complete data. The "min" estimator uses only the last 
12 months of historical returns, whereas "com" uses the maximal complete history available. 
Both use standard estimators (i.e., via the cov function in R). The "rm" estimator is similar 
but discards any assets without the full 60 months of historical returns. These three rows in 
the table highlight that the more historical returns (within the five-year window) that can 
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be used to estimate covariances the better. The third section, containing the same acronyms 
with a leading "f", incorporates the value- weighted factor on same subset of returns. The 
improved characteristics in the table show that good factors can be quite helpful. 

The results are further improved when the estimators exploit the tractable factorization of 
the likelihood under the monotone missingness pattern using shrinkage regression (5 = 0.2), 
as shown in the penultimate section of the table. Notice that the Sharpe ratios indicate 
that the fully Bayesian estimators (with a "b" prefix) outperform the classical alternative, 
and that the lasso/NG methods are better than the ridge. In addition to fully accounting 
for all posterior uncertainties, the Bayesian estimators have the advantage of being able 
to deal with "gaps" in the data, via MDA [Section 14.1] , and can account for estimation 
risk [Section 14. 3j . Observe that these estimators distribute the weight less evenly among 
the assets, having fewer assets with > 0.5% of the weight on average compared to their 
classical counterparts. The higher concentration of weight on the appropriate assets leads to 
lower variance portfolios which deviate further from the market, hence the somewhat higher 
tracking error. The results for the lasso are nearly identical to those under the extended NG 
formulation. 

The final section of the table shows that incorporating the value-weighted factor (now 
with 5 = 0) leads to further improvements. A sensible (prior) belief that the presence of a 
good factor causes many pairs of assets to be conditionally independent [Section 14.3] allows 
us to dial down the hierarchical prior on the proportion of nonzero regression coefficients: 
7r ~ Beta(l, 100). The results in the table suggest that the factor causes weight to be more 
evenly distributed in the case of the Bayesian estimators, but not the classical ones. 

The first eight rows (first three sections) of the table, and those correspondi ng to "lasso" , 



'ridge " , "flasso" and "fridge" , are nearly identical to ones in a similar table from lGramacy et al. 



( 120081 ). Any variation is due to different random seeds. This calibration allows us to draw 
comparisons to the other classical monomvn estimators including ones based upon PCR, etc. 
In short, the fully Bayesian approach(es) reign supreme. The improvements may appear 
to be modest at a glance. But in light of the fact that financial markets are highly un- 
predictable they are actually quite substantial. As a matter of curiosity we also calculated 
statistics under the posterior mean parameterization, i.e., without accounting for estimation 
risk. The Sharpe ratios were: 0.549 (0.562 with the factor) under the ridge, 0.554 (0.562) 
under the lasso, and 0.553 (0.563) under the NG. These numbers point to an improvement 
over the classical approach using the posterior mean, but indicate that the incorporation of 
estimation risk is crucial to get the best portfolio weights. 

Figure H] summarizes the variability in the Monte Carlo experiment showing the distri- 
bution (via boxplots) of the Sharpe ratios and tracking error obtained for each of the 50 
random paths through the 26 years, thereby complementing the averages presented in Table 
[2J We can immediately see that the classical ridge regression approach is highly variable, 
often yielding extremely low Sharpe ratios and high tracking error. The Bayesian approach 
offers a dramatic improvement here. In the case of the lasso/NG we can see that the vari- 
ability of the Sharpe ratios for the Bayesian implementations are higher than their classical 
counterparts. However, it is crucial to observe that the boxplots extend in the direction of 



20 




Figure 4: Boxplots of Sharpe ratios (top) and the tracking error (bottom) obtained over 50 
random paths through 26 years. The vertical bars correspond to horizontal ones in Table EJ 



larger Sharpe ratios, offering improved estimators. 



6 Discussion 



We have shown how the classical (MLE/CV) shrinkage approach of iGramacy et al.l (120081 ) to 
joint multivariate inference under monotone missingness may be treated in a fully Bayesian 
way to great effect. The Bayesian approach facilitates extensions to deal with "gaps" in 
the monotone pattern via MDA and heavy-tailed data, and can account for estimation risk. 
None of these features could be accommodated by the classical approach. In synthetic data 
experiments we demonstrated the descriptive, predictive, and inferential superiority of the 
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Bayesian methods. Using real financial returns data we showed how the fully Bayesian 
approach leads to portfolios with lower variability and thus higher Sharpe ratio. 

O ur methods b ear s ome similarity to other r ecent approaches to covariance estima- 



tion. 



Levina et al.l (120081 ) and ICarvalho and Scottl (120091 ) offer priors on covariance matrices 
with shrinkage based on Cholesky decompositions. Our monotone factorization (originally: 



Andersen! . 119571 ) can be seen as a special case where the column order, and thus the underly- 
ing graphical model structure, is fixed by historical availability. This has certain advantages 
in our context (relative inferential simplicity, tracta bility , MDA and heavy tail extensions), 
but it may not be optimal in complete data cases. iLiul ( 119961 . Il995h provides a model for 
Bayesian robust multivariate joint inference for data exhibiting a monotone missingness pat- 
tern. The posteriors are derived using extensions of Bartlett's decomposition, and "gaps" 
may be handled via MDA. Although a wealth of theoretical and empirical results are pro- 
vided, it is not clear how the methods can be adapted t o "big p sma ll n" setting. A possible 
way forward involves Bayesian dynamic factor models ( jWestl . 120031 ) which are designed for 
"big p small n" and retain the ability to handle missing data in a tractable way. 

It is becoming well understood that the Laplace prior has many drawbacks, even when 
generalized by the NG. For example, it is known to produce biased estimates of the nonzero 
coefficients and to underestimate of the number of zeros J A newly developed shrinkage 



prior for regression called the horseshoe (ICarvalho et al.l . 120081 1 shows promise as a tractable 
alternative (i.e., via GS) without the bias problems. Incorporating horseshoe regression 
in our framework is part of our future work. Another obvious extension is to relax the 
i.i.d. assumption to obtain more dynamic estimators. One possible approach might be to 
use weighted regressions for the BIZ's with weights decaying back in time. 
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